Forest-based Algorithms in Natural Language Processing

نویسندگان

  • Liang Huang
  • Qun Liu
چکیده

FOREST-BASED ALGORITHMS IN NATURAL LANGUAGE PROCESSING Liang Huang Supervisors: Aravind K. Joshi and Kevin Knight Many problems in Natural Language Processing (NLP) involves an efficient search for the best derivation over (exponentially) many candidates. For example, a parser aims to find the best syntactic tree for a given sentence among all derivations under a grammar, and a machine translation (MT) decoder explores the space of all possible translations of the source-language sentence. In these cases, the concept of packed forest provides a compact representation of huge search spaces by sharing common sub-derivations, where efficient algorithms based on Dynamic Programming (DP) are possible. Building upon the hypergraph formulation of forests and well-known 1-best DP algorithms, this dissertation develops fast and exact k-best DP algorithms on forests, which are orders of magnitudes faster than previously used methods on state-of-the-art parsers. We also show empirically how the improved output of our algorithms has the potential to improve results from parse reranking systems and other applications. We then extend these algorithms to approximate search when the forests are too big for exact inference. We discuss two particular instances of this new method, forest rescoring for MT decoding, and forest reranking for parsing. In both cases, our methods perform orders of magnitudes faster than conventional approaches. In the latter, faster search also leads to better learning, where our approximate decoding makes whole-Treebank discriminative training practical and results in an accuracy better than any previously reported systems trained on the Treebank. Finally, we apply the above materials to the problem of syntax-based translation and propose a new paradigm, forest-based translation. This scheme translates a packed forest of the source sentence into a target sentence, rather than just using 1-best or k-best parses as in usual practice. By considering exponentially many alternatives, it alleviates the propogation of parsing errors into translation, yet only comes with fractional overhead in running time. We also push this direction further to extract translation rules from packed

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Forest Stand Types Classification Using Tree-Based Algorithms and SPOT-HRG Data

Forest types mapping, is one of the most necessary elements in the forest management and silviculture treatments. Traditional methods such as field surveys are almost time-consuming and cost-intensive. Improvements in remote sensing data sources and classification –estimation methods are preparing new opportunities for obtaining more accurate forest biophysical attributes maps. This research co...

متن کامل

روش جدید متن‌کاوی برای استخراج اطلاعات زمینه کاربر به‌منظور بهبود رتبه‌بندی نتایج موتور جستجو

Today, the importance of text processing and its usages is well known among researchers and students. The amount of textual, documental materials increase day by day. So we need useful ways to save them and retrieve information from these materials. For example, search engines such as Google, Yahoo, Bing and etc. need to read so many web documents and retrieve the most similar ones to the user ...

متن کامل

A Search in the Forest: Efficient Algorithms for Parsing and Machine Translation based on Packed Forests A DISSERTATION PROPOSAL in Computer and Information Science

Many problems in Natural Language Processing (NLP) involves an efficient search for the best derivation over (exponentially) many candidates. For example, a parser aims to find the best syntactic tree for a given sentence among all derivations under a grammar, and a machine translation (MT) decoder explores the space of all possible translations of the source-language sentence. In these cases, ...

متن کامل

ADABOOST ENSEMBLE ALGORITHMS FOR BREAST CANCER CLASSIFICATION

With an advance in technologies, different tumor features have been collected for Breast Cancer (BC) diagnosis, processing of dealing with large data set suffers some challenges which include high storage capacity and time require for accessing and processing. The objective of this paper is to classify BC based on the extracted tumor features. To extract useful information and diagnose the tumo...

متن کامل

A Hybrid Optimization Algorithm for Learning Deep Models

Deep learning is one of the subsets of machine learning that is widely used in Artificial Intelligence (AI) field such as natural language processing and machine vision. The learning algorithms require optimization in multiple aspects. Generally, model-based inferences need to solve an optimized problem. In deep learning, the most important problem that can be solved by optimization is neural n...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008